Introduction

Data set being analyzed and motivations

Heart failure currently has a 5 year mortality rate of 50% affecting 6.5 million adult Americans. Despite later-stage heart failure with reduced ejection fraction (HFrEF) showing diverse etiologies and genetic contributions they are still considered to evolve via a final common pathway. Current therapies are relatively indifferent to disease etiology and treat HFrEF due to ischemic cardiomyopathy (ICM) the same as dilated cardiomyopathy (DCM), even though ICM has a much worse prognosis than DCM. This can be an indication of an incomplete understanding of the different biological mechanisms contributing to HFrEF (Sweet et al. 2018).

This data set is from a study which looks at 64 human left ventricular samples, 37 of which are from DCM, 13 from ICM, and 14 from non-failing hearts (NF). The goal of this study was to uncover common and etiology-specific gene signatures between the three cohorts (Sweet et al. 2018).

This data set was downloaded from the Gene Expression Omnibus (GEO) and has a GEO accession of GSE116250.

Normalization and initial data exploration

In Assignment 1, the data set has been normalized using RPKM and genes with mean RPKM across all samples less than five have been filtered out to remove some noise. The normalization method of RPKM was the decision made on the authors’ part as the data set from GEO came in RPKM. Prior to removing lowly expressed genes with mean RPKM less than 5 there were 57974 genes. After removing these lowly expressed genes 12381 genes remained.

All expression values have been mapped to HUGO symbols. There were 6 specific genes that were repeated and the decision made was to keep them in the final data set. The reason being at that point it was not possible to tell if they were due to a measurement error or another type of error and those genes could have functions that are important to look at in the future. The duplicated genes all had frequency of 2 in this data set.

An important note is the normalization and data processing methods were not clearly described by the authors in their original paper. The data provided was in RPKM meaning it has been normalized by library size and the length of transcript allowing within sample comparisons.

In Assignment 1, I attempted to perform TMM normalization to see if the authors performed any normalization between samples. The hypothesis was if the data has been normalized between samples as well then the plots before TMM normalization and after TMM normalization it should be the same. The result after TMM normalization was in the density plot there was a large spike at 0 essentially making all the data at one point. This density plot does not make sense. Furthermore, this is a drastic change to the original data and is undesired. The conclusion from this initial exploration was the data was posted already normalized so further normalization methods should not be used.

Note, when normalization was tried on this RPKM data provided from the authors to use TMM it just showed one large spike at zero essentially putting all our data at essentially the same value. This drastically changed the data from pre-TMM normalization. We do not want to use normalization to drastically alter the data

Due to the results seen from the plots after normalization and lack of information in the paper about normalization methods used other than RPKM, it is concluded the data posted is already normalized. Further normalization of this data results in dramatic changes to the data that isn’t desirable. More detail about this can be found in the appendix.

Before we begin with differential gene expression analysis and preliminary ORA lets load our normalized and filtered data and include any packages we need going forward.

# Load in the normalized and filtered data from assignment 1
finalFilteredData <- "final_filtered_exp_data.rds"
expData <- readRDS(finalFilteredData)

# Some duplicate genes were kept from the normalization and initial analysis, let's remove them now as they do not seem to be significant
expData <- dplyr::distinct(expData, Gene, .keep_all = TRUE)
# Install required packages 
if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
if (!requireNamespace("ggplot2", quietly = TRUE)) {
  install.packages("ggplot2")
}
if (!requireNamespace("tibble", quietly = TRUE)) {
  install.packages("tibble")
}
if (!requireNamespace("dplyr", quietly = TRUE)) {
  install.packages("dplyr")
} 
if (!requireNamespace("edgeR", quietly = TRUE)) {
  install.packages("edgeR")
}
if (!requireNamespace("limma", quietly = TRUE)) {
  install.packages("limma")
}
if(!requireNamespace("tidyr", quietly = TRUE)) {
  install.packages("tidyr")
}
if(!requireNamespace("ComplexHeatmap", quietly = TRUE)) {
  install.packages("ComplexHeatmap")
}
if(!requireNamespace("circlize", quietly = TRUE)) {
  install.packages("circlize")
}
if(!requireNamespace("stringr", quietly = TRUE)) {
  install.packages("stringr")
}
if(!requireNamespace("gprofiler2", quietly = TRUE)) {
  install.packages("gprofiler2")
}
suppressPackageStartupMessages({
  library("ggplot2")
  library("tibble")
  library("dplyr")
  library("edgeR")
  library("ComplexHeatmap")
  library("circlize")
  library("stringr")
  library("gprofiler2")
})

Differential Gene Expression

Looking at the MDS plot from Assignment 1 there is clustering among samples of the same disease cohort so disease status appears to be an important factor in our model.

Let’s create the differential expression model.

# First, we need to extract the cohort and sample number from the expData
cohorts <- gsub("[0-9]", "", colnames(expData)[3:66])
sampleNum <- gsub("[A-Z]", "", colnames(expData)[3:66])
samples <- data.frame(cohorts, sampleNum)
rownames(samples) <- colnames(expData)[3:66]

# maybe remove MDS plot
# Let's look at the MDS plot
limma::plotMDS(expData[, 3:66],
               col = c("darkgreen", "blue", "purple")[factor(samples$cohorts)],
               main = "MDS Plot Colored by Cohort"
               )
legend("bottomleft",
       legend = levels(factor(samples$cohorts)),
       pch = c(1),
       col = c("darkgreen","blue", "purple"),
       title = "Cohort",
       cex = 0.75)

# Second, create a linear model
modelDesign <- model.matrix(~samples$cohorts)
knitr::kable(head(modelDesign), type="html") # probably don't need to display
(Intercept) samples\(cohortsICM| samples\)cohortsNF
1 0 1
1 0 1
1 0 1
1 0 1
1 0 1
1 0 1
# Third, create our data matrix
expMatrix <- as.matrix(expData[, 3:66])
rownames(expMatrix) <- expData$Gene
colnames(expMatrix) <- colnames(expData[3:66])

minSet <- Biobase::ExpressionSet(assayData = expMatrix)

# Third, fit the data to our model
fit <- limma::lmFit(minSet, modelDesign)

# Apply empirical Bayes to compute differential expression for the above model
bayes <- limma::eBayes(fit, trend = TRUE)

Let’s get the top hits

topFit <- limma::topTable(bayes,
                         coef = ncol(modelDesign),
                         adjust.method = "BH",
                         number = nrow(expMatrix)) # Use BH to adjust 
# assign HGNC symfols to the topFit table
outputHits <- merge(expData[, c("Gene", "hgnc_symbol")],
                    topFit,
                    by.y=0, by.x=1,
                    all.y=TRUE)
# sort hits in increasing p values
outputHits <- outputHits[order(outputHits$P.Value),]
head(outputHits)
# knitr::kable(outputHits[1:10, 2:8], type="html", row.names=FALSE) # look at top 10 hits

Let’s see how many genes are significantly expressed

length(which(outputHits$P.Value < 0.05))
## [1] 6179

There are 6179 genes that are significantly expressed.

The thresholds chosen were 0.05 for the p-value as that is usually a standard value. Additionally, there are a fair amount of results, about half of the total genes, so the p-value threshold doesn’t need to be increased to be made less stringent.

Let’s see the amount that are after correction

length(which(outputHits$adj.P.Val < 0.05))
## [1] 5236

There is still a fair amount of genes differentially expressed pass correction at 5236 out of 12381 genes total. The thresholds chosen were 0.05 for the p-value as that is usually a standard value. Additionally, there are a fair amount of results, about half of the total genes, so the p-value threshold doesn’t need to be increased to be made less stringent.

This makes me think the authors have potentially done some filtering of the data as their data analysis procedure is not well documented since I would have expected a lesser number of genes to be significantly differentially expressed both before and after correction.

Volcano plot

plot(outputHits$logFC, -log10(outputHits$P.Value), pch = 20,
     main = "Volcano plot of differential expression in DCM, ICM, and NF",
     xlim = c(-10, 10),
     xlab = "logFC",
     ylab = "-log10(P-value)")

# color points based on if they're upregulated, down regulated or neither
# upregulated is red
upregulated <- outputHits[which(outputHits$P.Value < 0.05 & outputHits$logFC > 0), ]
with(upregulated, points(logFC, -log10(P.Value), pch = 20, col = "red"))

# make downregulated points blue
downregulated <- outputHits[which(outputHits$P.Value < 0.05 & outputHits$logFC < 0), ]
with(downregulated, points(logFC -log10(P.Value), pch = 20, col = "blue"))

legend("topright",
       legend = c("up-regulated", "down-regulated", "neither"), 
       fill = c("red","blue", "black"),
       cex = 1)
Figure X: Volcano plots of differentially expressed genes

Figure X: Volcano plots of differentially expressed genes

# ggplot2::ggplot(outputHits, x=logFC, y=P.Value) +
#   ggplot2::geom_point(aes(x=logFC, y=P.Value)) +
#   ggplot2::xlim(-10, 10)
ggplot2::ggplot(outputHits, x=logFC, y=-log10(P.Value)) +
  ggplot2::geom_point(aes(x=logFC, y=-log10(P.Value))) +
  ggplot2::xlim(-10, 10)
Figure X: Volcano plots of differentially expressed genes

Figure X: Volcano plots of differentially expressed genes

Volcano plot for statistically significant differentially expressed genes past correction.

plot(outputHits$logFC, -log10(outputHits$adj.P.Val), pch = 20,
     main = "Volcano plot of differential expression in DCM, ICM, and NF",
     xlim = c(-10, 10),
     xlab = "logFC",
     ylab = "-log10(adjusted P-value)")

# color points based on if they're upregulated, down regulated or neither
# upregulated is red
upregulated <- outputHits[which(outputHits$adj.P.Val < 0.05 & outputHits$logFC > 0), ]
with(upregulated, points(logFC, -log10(adj.P.Val), pch = 20, col = "red"))

# make downregulated points blue
downregulated <- outputHits[which(outputHits$adj.P.Val < 0.05 & outputHits$logFC < 0), ]
with(downregulated, points(logFC -log10(adj.P.Val), pch = 20, col = "blue"))

legend("topright",
       legend = c("up-regulated", "down-regulated", "neither"), 
       fill = c("red","blue", "black"),
       cex = 1)

## Heatmap Let’s make a heatmap with the statistically significant genes both before and after correction.

# Get genes which show significant differential expression
sigGenes <- outputHits$Gene[which(outputHits$P.Value < 0.05)]
sigExp <- expData[which(expData$Gene %in% sigGenes), ]
sigGeneNames <- sigExp$Gene
sigExp <- as.matrix(sigExp[, 3:66])
rownames(sigExp) <- sigGeneNames

# scale each gene and centre around the mean for row normalization
heatmapMatrix <- t(scale(t(sigExp)))

# assign heatmap colors
if (min(heatmapMatrix) == 0) {
  heatmapCol <- circlize::colorRamp2(c(0, max(heatmapMatrix)),
                           c("white", "red"))
} else {
    heatmapCol <- circlize::colorRamp2(c(min(heatmapMatrix), 0, max(heatmapMatrix)),
                             c("blue", "white", "red"))
}

# make the heatmap
heatmap <- ComplexHeatmap::Heatmap(as.matrix(heatmapMatrix),
                                   show_row_dend = TRUE,
                                   show_column_dend = TRUE,
                                   col = heatmapCol,
                                   show_column_names = TRUE,
                                   show_row_names = FALSE, 
                                   show_heatmap_legend = TRUE,
                                   column_order = 
                                     stringr::str_sort(colnames(heatmapMatrix),
                                                       numeric = TRUE),
                                   column_names_gp = gpar(fontsize = 7),
                                   column_names_rot = 90,
                                   name = "Expression level",
                                   column_title = "Sample",
                                   row_title = "Genes")
heatmap

After correction:

# Get genes which show significant differential expression
sigGenesAfterCorrection <- outputHits$Gene[which(outputHits$adj.P.Val < 0.05)]
sigExpAfterCorrection <- expData[which(expData$Gene %in% sigGenesAfterCorrection), ]
sigAfterCorrectionGeneNames <- sigExpAfterCorrection$Gene
sigExpAfterCorrection <- as.matrix(sigExpAfterCorrection[, 3:66])
rownames(sigExpAfterCorrection) <- sigAfterCorrectionGeneNames

# scale each gene and centre around the mean for row normalization
heatmapMatrixAfterCorrection <- t(scale(t(sigExpAfterCorrection)))

# assign heatmap colors
if (min(heatmapMatrixAfterCorrection) == 0) {
  heatmapColAfterCorrection <- circlize::colorRamp2(c(0, max(heatmapMatrixAfterCorrection)),
                           c("white", "red"))
} else {
    heatmapColAfterCorrection <- circlize::colorRamp2(c(min(heatmapMatrixAfterCorrection), 0, max(heatmapMatrixAfterCorrection)),
                             c("blue", "white", "red"))
}

# make the heatmap
heatmap <- ComplexHeatmap::Heatmap(as.matrix(heatmapMatrixAfterCorrection),
                                   show_row_dend = TRUE,
                                   show_column_dend = TRUE,
                                   col = heatmapCol,
                                   show_column_names = TRUE,
                                   show_row_names = FALSE, 
                                   show_heatmap_legend = TRUE,
                                   column_order = 
                                     stringr::str_sort(colnames(heatmapMatrixAfterCorrection),
                                                       numeric = TRUE),
                                   column_names_gp = gpar(fontsize = 7),
                                   column_names_rot = 90,
                                   name = "Expression level",
                                   column_title = "Sample",
                                   row_title = "Genes")
heatmap

Both before and after correction they look almost identical.

Threshold over-representation analysis

Since the heatmap and the number of genes past multiple hypothesis correction are similar to the heatmap and amount of genes before multiple hypothesis testing I will use the genes shown to be statistically significant before multiple hypothesis testing to keep more results.

With the significantly up-regulated and down-regulated genes let’s run a threshold gene set enrichment analysis. G:Profiler will be used.

First let’s divide the genes into up and down-regulated.

# First divide the genes into up-regulated and down-regulated
sigExp <- outputHits[which(outputHits$P.Value < 0.05), ]
upRegulatedGenes <- sigExp[which(sigExp$logFC > 0), ]
downRegulatedGenes <- sigExp[which(sigExp$logFC < 0), ]

Next, we will use G:Profiler

# Let's examine the version of annotation data used in G:Profiler
versionInfo <- gprofiler2::get_version_info(organism = "hsapiens")

versionDF <- as.data.frame(unlist(versionInfo$sources))
names(versionDF)[1] <- "version"
versionDF[, "annotation_data"] <- row.names(versionDF)

# only keep those from GO Biological Process, KEGG, Reactome, and Wikipathways
versionDF <- versionDF[versionDF$annotation_data %in% c("GO:BP.version", "KEGG.version", "REAC.version", "WP.version"), ]

# Reformat the annotation data version table 
versionDF <- versionDF[, c(2, 1)]
versionDF$annotation_data <- c("GO:BP", "KEGG", "REAC", "WP")
row.names(versionDF) <- 1:4
head(versionDF)

Let’s now begin the ORA

# ORA for all significantly differentially expressed genes 
gostAll <- gprofiler2::gost(query = sigExp$Gene, 
                            sources = c("GO:BP", "KEGG", "REAC", "WP"),
                            significant = FALSE,
                            user_threshold = 0.05,
                            correction_method = "fdr",
                            exclude_iea = TRUE)
# ORA for up-regulated genes
gostUp <- gprofiler2::gost(query = upRegulatedGenes$Gene,
                           sources = c("GO:BP", "KEGG", "REAC", "WP"),
                            significant = FALSE,
                            user_threshold = 0.05,
                            correction_method = "fdr",
                            exclude_iea = TRUE)
# ORA for down-regulated genes
gostDown <- gprofiler2::gost(query = downRegulatedGenes$Gene,
                           sources = c("GO:BP", "KEGG", "REAC", "WP"),
                            significant = FALSE,
                            user_threshold = 0.05,
                            correction_method = "fdr",
                            exclude_iea = TRUE)

Let’s see how many genesets were returned with the threshold 0.05 and what the top results are.

# Both up-regulated and down-regulated (all genes)
nrow(gostAll$result)
## [1] 13595
head(gostAll$result[order(gostAll$result$p_value, decreasing = FALSE),])
# Just up-regulated
nrow(gostUp$result)
## [1] 11182
head(gostUp$result[order(gostUp$result$p_value, decreasing = FALSE),])
# Just down-regulated
nrow(gostDown$result)
## [1] 11374
head(gostDown$result[order(gostDown$result$p_value, decreasing = FALSE),])

From this we see there are over 10000 genesets returned in each ORA and the top terms returned are very general such as metabolic process. To combat this we will filter the results to get more specific results.

# Filter the results to only those with term size between 1 and 200 inclusive
gostAllFilteredResults <- gostAll$result[which(gostAll$result$term_size <= 200 & 
                                                 gostAll$result$term_size >= 1), ]
nrow(gostAllFilteredResults)
## [1] 12658
nrow(gostAll$result) - nrow(gostAllFilteredResults) # how many results removed
## [1] 937
head(gostAllFilteredResults[order(gostAllFilteredResults$p_value, decreasing = FALSE),])

We now see the results are more specific with the top term being mRNA splicing and 937 terms were removed leaving 12658 for ORA done on both up and down-regulated genes.

# Filter the results to only those with term size between 1 and 200 inclusive
gostUpFilteredResults <- gostUp$result[which(gostUp$result$term_size <= 200 & 
                                                 gostUp$result$term_size >= 1), ]
nrow(gostUpFilteredResults)
## [1] 10245
nrow(gostUp$result) - nrow(gostUpFilteredResults) # how many results removed
## [1] 937
head(gostUpFilteredResults[order(gostUpFilteredResults$p_value, decreasing = FALSE),])

For up-regulated genes 937 terms were removed leaving 10245 terms and the top one being mitochondrial gene expression.

# Filter the results to only those with term size between 1 and 200 inclusive
gostDownFilteredResults <- gostDown$result[which(gostDown$result$term_size <= 200 & 
                                                 gostDown$result$term_size >= 1), ]
nrow(gostDownFilteredResults)
## [1] 10438
nrow(gostDown$result) - nrow(gostDownFilteredResults) # how many results removed
## [1] 936
head(gostDownFilteredResults[order(gostDownFilteredResults$p_value, decreasing = FALSE),])

There were 936 terms removed leaving 10438 in down-regulated genes and the top term is now diseases associated with glycosaminoglycan metabolism.

Let’s visualize the results with Manhattan plots. Note the terms above the line are considered significant after correction.

gprofiler2::gostplot(gostAll)

Figure X: Manhattan Plot of ORA results for all differentially expressed genes

gprofiler2::gostplot(gostUp)

Figure X: Manhattan Plot of ORA results for up-regulated genes

gprofiler2::gostplot(gostDown)

Figure X: Manhattan Plot of ORA results for down-regulated genes

Interpretation and discussion questions

  1. Do the over-representation results support conclusions or mechanism discussed in the original paper?

Let’s see the top result for all genes, up-regulated, and down-regulated based on the annotation source used.

# All genes
# Get top result by annotation source used
allTopResults <- gostAllFilteredResults[which(gostAllFilteredResults$source == "GO:BP"), ][1, ]
allTopResults <- rbind(allTopResults,
                       gostAllFilteredResults[which(gostAllFilteredResults$source == "KEGG"), ][1, ])
allTopResults <- rbind(allTopResults,
                       gostAllFilteredResults[which(gostAllFilteredResults$source == "REAC"), ][1, ])
allTopResults <- rbind(allTopResults,
                       gostAllFilteredResults[which(gostAllFilteredResults$source == "WP"), ][1, ])
row.names(allTopResults) <- c(1:4)
allTopResults <- subset(allTopResults, select = c(9:11))
allTopResults
# Up-regulated genes
# Get top result by annotation source used
upTopResults <- gostUpFilteredResults[which(gostUpFilteredResults$source == "GO:BP"), ][1, ]
upTopResults <- rbind(upTopResults,
                       gostUpFilteredResults[which(gostUpFilteredResults$source == "KEGG"), ][1, ])
upTopResults <- rbind(upTopResults,
                       gostUpFilteredResults[which(gostUpFilteredResults$source == "REAC"), ][1, ])
upTopResults <- rbind(upTopResults,
                       gostUpFilteredResults[which(gostUpFilteredResults$source == "WP"), ][1, ])
row.names(upTopResults) <- c(1:4)
upTopResults <- subset(upTopResults, select = c(9:11))
upTopResults
# Up-regulated genes
# Get top result by annotation source used
downTopResults <- gostDownFilteredResults[which(gostDownFilteredResults$source == "GO:BP"), ][1, ]
downTopResults <- rbind(downTopResults,
                       gostDownFilteredResults[which(gostDownFilteredResults$source == "KEGG"), ][1, ])
downTopResults <- rbind(downTopResults,
                       gostDownFilteredResults[which(gostDownFilteredResults$source == "REAC"), ][1, ])
downTopResults <- rbind(downTopResults,
                       gostDownFilteredResults[which(gostDownFilteredResults$source == "WP"), ][1, ])
row.names(downTopResults) <- c(1:4)
downTopResults <- subset(downTopResults, select = c(9:11))
downTopResults
  1. Can you find evidence, i.e. publications, to support some of the results that you see. How does this evidence support your results.

https://bmccardiovascdisord.biomedcentral.com/articles/10.1186/s12872-020-01596-w https://www.nature.com/articles/s41598-022-26277-w#Sec12

Differential Gene Expression

  1. Calculate p-values for each of the genes in your expression set. How many genes were significantly differentially expressed? What thresholds did you use and why?
  2. Multiple hypothesis testing - correct your p-values using a multiple hypothesis correction method. Which method did you use? And Why? How many genes passed correction?
  3. Show the amount of differentially expressed genes using an MA Plot or a Volcano plot. Highlight genes of interest.
  4. Visualize your top hits using a heatmap. Do you conditions cluster together? Explain why or why not.

Threshold over-representation analysis

  1. Which method did you choose and why?
  2. What annotation data did you use and why? What version of the annotation are you using?
  3. How many genesets were returned with what thresholds?
  4. Run the analysis using the up-regulated set of genes, and the down-regulated set of genes separately. How do these results compare to using the whole list (i.e all differentially expressed genes together vs. the up-regulated and down regulated differentially expressed genes separately)?

The link to the corresponding journal entry can be found here…

Appendix

Here I decided to regenerate the density plot for the data. In Assignment 1 it was noted that the data has been filtered to remove genes whose mean RPKM across all samples was less than 5, as the authors did in their paper. This is to remove some noise.

Note, in Assignment 1 the density plots before filtering also looked strange as a lot of the density was at 0 with a very long tail. This is an indication that I was applying a log to a log since in that original plot I did log2(RPKM + 1). The reason I decided to plot log2(RPKM + 1) was because in the paper the authors did not mention any other methods applied to the data other than normalizing it with RPKM. The authors must have done something else to their data other than applying RPKM that they did not document.

I have decided to redo the density plots without using log2 and with the data filtered to only include genes who have mean RPKM greater than 5 across all samples.

filteredExpData <- dplyr::select(expData, -c(2, 67, 68, 69, 70, -71)) 
pivotedData <- tidyr::pivot_longer(filteredExpData, cols = !Gene, names_to = "sample", values_to = "rpkm")

pivotedData$cohort <- gsub("[0-9]", "", pivotedData$sample)
nfSamples <- dplyr::filter(pivotedData, pivotedData$cohort == "NF")
dcmSamples <- dplyr::filter(pivotedData, pivotedData$cohort == "DCM")
icmSamples <- dplyr::filter(pivotedData, pivotedData$cohort == "ICM")

nfDensityPlot <- ggplot2::ggplot(nfSamples, aes(x=rpkm, color=sample)) +
  ggplot2::geom_density(key_glyph = draw_key_smooth) +
  ggplot2::scale_x_continuous(limits = c(-2, 12)) +
  ggplot2::labs(title = "Smoothing density of RPKM from NF samples",
                x = "RPKM",
                y = "Smoothing density of RPKM",
                color = "Sample")
nfDensityPlot

dcmDensityPlot <- ggplot2::ggplot(dcmSamples, aes(x=rpkm, color=sample)) +
  ggplot2::geom_density(key_glyph = draw_key_smooth) + 
  scale_x_continuous(limits = c(-2, 12)) +
  ggplot2::labs(title = "Smoothing density of RPKM from DCM samples",
                x = "RPKM",
                y = "Smoothing density of RPKM",
                color = "Sample")
dcmDensityPlot

icmDensityPlot <- ggplot2::ggplot(icmSamples, aes(x=rpkm, color=sample)) +
  ggplot2::geom_density(key_glyph = draw_key_smooth) + 
  scale_x_continuous(limits = c(-2, 12)) +
  ggplot2::labs(title = "Smoothing density of RPKM from ICM samples",
                x = "RPKM",
                y = "Smoothing density of RPKM",
                color = "Sample")
icmDensityPlot

Figure: X, updated density plot

Although the data does not look perfect this looks better than the previous density plot from pre-filtering for mean RPKM less than 5 and plotting log2(RPKM + 1). In the later it appeared all the density was near zero and there was a long tail.

For brevity the code and post normalized plot has not been included in this report as it is not the main focus of this assignment. Assignment 1 showing the previous pre-normalization and post normalization plots can be found at: https://github.com/bcb420-2023/Angela_Ng/blob/main/A1/A1_data_set_selection_initial_processing.html.

References

R Core Team. 2022. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Sweet, Mary E, Andrea Cocciolo, Dobromir Slavov, Kenneth L Jones, Joseph R Sweet, Sharon L Graw, T Brett Reece, et al. 2018. “Transcriptome Analysis of Human Heart Failure Reveals Dysregulated Cell Adhesion in Dilated Cardiomyopathy and Activated Immune Pathways in Ischemic Heart Failure.” BMC Genomics 19: 1–14.